Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

188

The Nature of Living Things

Table 14.2 will reveal some striking instances—the genomes of amoebae and lungﬁsh

considerably exceeding in size those of ourselves, for example.

Before delving into this question more deeply, three relatively trivial factors affect-

ing the C-value should be pointed out. The ﬁrst is experimental uncertainty, and

ambiguity in the precise deﬁnition of the C-value. Second, in some cases, genome

size is merely estimated from the total mass of DNA in a cell. This makes the given

value highly dependent on polyploidy, unusual in mammals but not in amphibians

and ﬁsh, and rather common in plants. For example, the lungﬁsh, which has a con-

spicuously large C-value, is known to be tetraploid. Amoebae, which apparently

have an even larger C-value, are likely to be polyploid and, moreover, the amount

of DNA found in an amoeba cell may well be inﬂated by the remains of genetic

material of recently ingested prey. Care should therefore be taken to ascertain the

amount of genetic material corresponding to the haploid genome for the purposes

of comparison. The third factor is the presence of enormous quantities of repetitive

DNA in many eukaryotic genomes. These repetitive sequences include retrotrans-

posons, vestiges of retroviruses, and so forth. Probably about half of the human

genome can be accounted for in this way, and it seems not unreasonable to consider

this as “junk” (although it appears to play a rôle in the condensation of the DNA into

heterochromatin; see Sect. 14.4.4). ²⁶

Is There a G-Value Paradox?

By correcting for polyploidy and repetitive junk, one arrives at the quantities of

DNA involved in protein synthesis (both the genes themselves and the regulatory

overhead). In some cases, the actual number of genes (the G-value) can be estimated

with reasonable conﬁdence; in other cases, the simple application of a compression

algorithm (Sect. 7.4) can be used to provide a minimal description (an approximation

to the algorithmic information content; see Chap. 11), which correlates much better

with presumed organismal complexity (as measured, for example, by the number of

different cell types). Where gene number estimates are available, however, the more

complex organisms do not seem to have enough genes. Especially if the ﬁgure for H.

sapiens has to be revised downward to a mere 20 000, we end up with fewer genes

than A. thaliana, for example! This is the so-called G-value paradox. Its resolution

would appear to lie with enhanced alternative splicing possibilities for more complex

organisms. We humans appear to have the largest intron sizes, for example. ²⁷

26 Regarding the remainder, about 5% is considered to be conserved (by comparison with the

mouse); 1.2% is estimated to be used for coding proteins, and the remaining 3.8% is referred to as

“noncoding”, although conservation of sequence is taken to imply a signiﬁcant function (it seems

very probable that this “noncoding” DNA is used to encode the small interfering RNA used to

supplement protein-based transcription factors as regulatory elements). That still leaves the enigma

of the remaining 40–50% that is neither repetitive nor coding in any sense understood at present.

27 Taft et al. (1992). Note the connexion between alternative splicing and Tonegawa’s mechanism

for generating B-cell lymphocyte (and hence antibody) diversity in the immune system (Sect. 14.6).